\section{Starting Sparkling Water}

This section focuses on how Sparkling Water can be started in both backends and all language clients.
You can submit a Sparkling Water code as a Spark batch job or you can explore its functionality in an interactive shell.

Before you start, please make sure that you have downloaded Sparkling Water from \url{https://www.h2o.ai/download/} for
your desired Spark version.

\subsection{Setting up the Environment}

In the case of Scala, all dependencies are already provided inside the Sparkling Water artifacts.

In the case of Python, please make sure that you have the following Python packages installed:
\begin{itemize}
    \item requests
    \item tabulate
\end{itemize}.
Also please make sure that your Python environment is set-up to run regular Spark applications.

In the case of R, please make sure that SparklyR and H2O libraries are installed. The H2O version needs to match the version
used in Sparkling Water. For example, when using RSparkling 3.30.0.7, make sure that you have H2O of version 3.30.0.7 installed.

\subsection{Starting Interactive Shell with Sparkling Water}

To start interactive Scala shell, run:

\begin{lstlisting}[style=bash]
./bin/sparkling-shell
\end{lstlisting}

To start interactive Python shell, run:

\begin{lstlisting}[style=bash]
./bin/pysparkling
\end{lstlisting}

To use RSparkling in an interactive environment, we suggest using RStudio.

\subsection{Starting Sparkling Water in Internal Backend}

In the internal backend, the H2O cluster is created automatically during the call of \texttt{H2OContext.getOrCreate}. Since
it is not technically possible to get the number of executors in Spark, Sparkling Water tries to discover all executors
during the initiation of \texttt{H2OContext} and starts H2O instance inside each of discovered executors. This solution
is the easiest to deploy; however when Spark or YARN kills the executor, the whole H2O cluster goes down since H2O
does not support high availability. Also, there are cases where Sparkling Water is not able to discover all Spark
executors and will start just on the subset of executors. The shape of the cluster can not be changed later.
Internal backend is the default backend for Sparkling Water. It can be changed via spark configuration property
\texttt{spark.ext.h2o.backend.cluster.mode} to \textbf{external} or \textbf{internal}. Another way how to change the
type of backend is by calling \texttt{setExternalClusterMode()} or \texttt{setInternalClusterMode()} method
on \texttt{H2OConf} class instance. \texttt{H2OConf} is a simple wrapper around \texttt{SparkConf} and inherits all
properties in spark configuration.

\texttt{H2OContext} can be started as:

\begin{itemize}
    \item \textbf{Scala} \begin{lstlisting}[style=Scala]
import ai.h2o.sparkling._
val conf = new H2OConf().setInternalClusterMode()
val h2oContext = H2OContext.getOrCreate(conf)
    \end{lstlisting}
    \item \textbf{Python} \begin{lstlisting}[style=Python]
from pysparkling import *
conf = H2OConf().setInternalClusterMode()
h2oContext = H2OContext.getOrCreate(conf)
    \end{lstlisting}
    \item \textbf{R} \begin{lstlisting}[style=R]
library(sparklyr)
library(rsparkling)
spark_connect(master = "local", version = "3.0.0")
conf <- H2OConf()$setInternalClusterMode()
h2oContext <- H2OContext.getOrCreate(conf)
    \end{lstlisting}
\end{itemize}

If \texttt{spark.ext.h2o.backend.cluster.mode} property was set to \textbf{internal} either on the command line or
on the \texttt{SparkConf}, the following call is sufficient

\begin{itemize}
    \item \textbf{Scala} \begin{lstlisting}[style=Scala]
import ai.h2o.sparkling._
val h2oContext = H2OContext.getOrCreate()
    \end{lstlisting}
    \item \textbf{Python} \begin{lstlisting}[style=Python]
from pysparkling import *
h2oContext = H2OContext.getOrCreate()
    \end{lstlisting}
    \item \textbf{R} \begin{lstlisting}[style=R]
library(sparklyr)
library(rsparkling)
spark_connect(master = "local", version = "3.0.0")
h2oContext <- H2OContext.getOrCreate()
    \end{lstlisting}
\end{itemize}

\subsection{External Backend}

In the external cluster, we use the H2O cluster running separately from the rest of the Spark application. This separation
gives us more stability because we are no longer affected by Spark executors being killed, which can
lead (as in the previous mode) to h2o cluster being killed as well.

There are two deployment strategies of the external cluster: manual and automatic. In manual mode, we need to start
the H2O cluster, and in the automatic mode, the cluster is started for us automatically based on our configuration.
In Hadoop environments, the creation of the cluster is performed by a simple process called H2O driver.
When the cluster is fully formed, the H2O driver terminates. In both modes, we have to store a path of H2O driver jar
to the environment variable H2O\_DRIVER\_JAR.

\begin{lstlisting}[style=bash]
H2O_DRIVER_JAR=$(./bin/get-h2o-driver.sh some_hadoop_distribution)
\end{lstlisting}

\subsubsection{Automatic Mode of External Backend}

In the automatic mode, the H2O cluster is started automatically. The cluster can be started automatically only in YARN
environment at the moment. We recommend this approach, as it is easier to deploy external clusters in this mode,
and it is also more suitable for production environments. When the H2O cluster is started on YARN, it is started
as a map-reduce job, and it always uses the flat-file approach for nodes to cloud up.

First, get H2O driver, for example, for cdh 5.8, as:

\begin{lstlisting}[style=bash]
H2O_DRIVER_JAR=$(./bin/get-h2o-driver.sh cdh5.8)
\end{lstlisting}


To start an H2O cluster and connect to it, run:

\begin{itemize}
    \item \textbf{Scala} \begin{lstlisting}[style=Scala]
import ai.h2o.sparkling._
val conf = new H2OConf()
    .setExternalClusterMode()
    .useAutoClusterStart()
    .setH2ODriverPath("path_to_h2o_driver")
    .setClusterSize(1)
    .setExternalMemory("2G")
    .setYARNQueue("abc")
val hc = H2OContext.getOrCreate(conf)
    \end{lstlisting}
    \item \textbf{Python} \begin{lstlisting}[style=Python]
from pysparkling import *
conf = H2OConf()
    .setExternalClusterMode()
    .useAutoClusterStart()
    .setH2ODriverPath("path_to_h2o_driver")
    .setClusterSize(1)
    .setExternalMemory("2G")
    .setYARNQueue("abc")
hc = H2OContext.getOrCreate(conf)
    \end{lstlisting}
    \item \textbf{R} \begin{lstlisting}[style=R]
library(sparklyr)
library(rsparkling)
spark_connect(master = "local", version = "3.0.0")
conf <- H2OConf()
    $setExternalClusterMode()
    $useAutoClusterStart()
    $setH2ODriverPath("path_to_h2o_driver")
    $setClusterSize(1)
    $setExternalMemory("2G")
    $setYARNQueue("abc")
hc <- H2OContext.getOrCreate(conf)
    \end{lstlisting}
\end{itemize}

In case we stored the path of the driver H2O jar to environmental variable H2O\_DRIVER\_JAR, we do not need
to call \texttt{setH2ODriverPath} as Sparkling Water will read the path from the environmental variable.

When specifying the queue, we recommend that this queue has YARN preemption off to have a stable H2O cluster.

\subsubsection{Manual Mode of External Backend on Hadoop}

In the manual mode, we need to start the H2O cluster before connecting to it manually. At this section, we
will start the cluster on Hadoop.

First, get the H2O driver, for example, for cdh 5.8, as:

\begin{lstlisting}[style=bash]
H2O_DRIVER_JAR=$(./bin/get-h2o-driver.sh cdh5.8)
\end{lstlisting}

Also, set path to sparkling-water-assembly-extensions-2.12-all.jar which is bundled in Sparkling Water for Spark 3.0 archive.

\begin{lstlisting}[style=bash]
SW_EXTENSIONS_ASSEMBLY=/path/to/sparkling-water-3.30.0.7-1-3.0/jars/sparkling-water-assembly-extensions_2.12-3.30.0.7-1-2.4-all.jar
\end{lstlisting}

Let's start the H2O cluster on Hadoop:

\begin{lstlisting}[style=bash]
hadoop -jar $H2O_DRIVER_JAR -libjars $SW_EXTENSIONS_ASSEMBLY -sw_ext_backend -jobname test -nodes 3 -mapperXmx 6g
\end{lstlisting}

The \texttt{-sw\_ext\_backend} option is required as without it, the cluster won't allow Sparkling Water client to connect to it.

After this step, we should have an H2O cluster with 3 nodes running on Hadoop.

To connect to this external cluster, run the following commands:

\begin{itemize}
    \item \textbf{Scala} \begin{lstlisting}[style=Scala]
import ai.h2o.sparkling._
val conf = new H2OConf()
    .setExternalClusterMode()
    .useManualClusterStart()
    .setH2OCluster("representant_ip", representant_port)
    .setCloudName("test")
val hc = H2OContext.getOrCreate(conf)
    \end{lstlisting}
    \item \textbf{Python} \begin{lstlisting}[style=Python]
from pysparkling import *
conf = H2OConf()
    .setExternalClusterMode()
    .useManualClusterStart()
    .setH2OCluster("representant_ip", representant_port)
    .setCloudName("test")
hc = H2OContext.getOrCreate(conf)
    \end{lstlisting}
    \item \textbf{R} \begin{lstlisting}[style=R]
library(sparklyr)
library(rsparkling)
spark_connect(master = "local", version = "3.0.0")
conf <- H2OConf()
    $setExternalClusterMode()
    $useManualClusterStart()
    $setH2OCluster("representant_ip", representant_port)
    $setCloudName("test")
hc <- H2OContext.getOrCreate(conf)
    \end{lstlisting}
\end{itemize}


The \texttt{representant\_ip} and \texttt{representant\_port} should be IP and port of the leader node of the started
H2O cluster from the previous step.

\subsubsection{Manual Mode of External Backend without Hadoop (standalone)}

In the manual mode, we need to start the H2O cluster before connecting to it manually. At this section, we will start
the cluster as a standalone application (without Hadoop).

First, get the assembly H2O Jar:

\begin{lstlisting}[style=bash]
H2O_JAR=$(./bin/get-h2o-driver.sh standalone)
\end{lstlisting}

.. code:: bash


Also, set path to sparkling-water-assembly-extensions-2.12-all.jar which is bundled in Sparkling Water for Spark 3.0 archive.

\begin{lstlisting}[style=bash]
SW_EXTENSIONS_ASSEMBLY=/path/to/sparkling-water-3.30.0.7-1-3.0/jars/sparkling-water-assembly-extensions_2.12-3.30.0.7-1-2.4-all.jar
\end{lstlisting}

To start an external H2O cluster, run:

\begin{lstlisting}[style=bash]
java -cp "$H2O_JAR:$SW_EXTENSIONS_ASSEMBLY" water.H2OApp -allow_clients -name test -flatfile path_to_flatfile
\end{lstlisting}


where the flat-file content are lines in the format of ip:port of the nodes where H2O is supposed to run. To
read more about flat-file and its format, please
see \url{https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/H2O-DevCmdLine.md#flatfile}.


To connect to this external cluster, run the following commands:

\begin{itemize}
    \item \textbf{Scala} \begin{lstlisting}[style=Scala]
import ai.h2o.sparkling._
val conf = new H2OConf()
    .setExternalClusterMode()
    .useManualClusterStart()
    .setH2OCluster("representant_ip", representant_port)
    .setCloudName("test")
val hc = H2OContext.getOrCreate(conf)
    \end{lstlisting}
    \item \textbf{Python} \begin{lstlisting}[style=Python]
from pysparkling import *
conf = H2OConf()
    .setExternalClusterMode()
    .useManualClusterStart()
    .setH2OCluster("representant_ip", representant_port)
    .setCloudName("test")
hc = H2OContext.getOrCreate(conf)
    \end{lstlisting}
    \item \textbf{R} \begin{lstlisting}[style=R]
library(sparklyr)
library(rsparkling)
spark_connect(master = "local", version = "3.0.0")
conf <- H2OConf()
    $setExternalClusterMode()
    $useManualClusterStart()
    $setH2OCluster("representant_ip", representant_port)
    $setCloudName("test")
hc <- H2OContext.getOrCreate(conf)
    \end{lstlisting}
\end{itemize}

The \texttt{representant\_ip} and \texttt{representant\_port} need to be IP and port of the leader node of the started
H2O cluster from the previous step.

\subsection{Memory Management}

In the case of the internal backend, H2O resides in the same executor JVM as Spark and the memory provided for H2O is configured
via Spark.
Executor memory (i.e., memory available for H2O in internal backend) can be configured via the Spark configuration
property \texttt{spark.executor.memory}. For example, as:

\begin{lstlisting}[style=bash]
./bin/sparkling-shell --conf spark.executor.memory=5g
\end{lstlisting}

or configure the property in:

\begin{lstlisting}[style=bash]
$SPARK_HOME/conf/spark-defaults.conf
\end{lstlisting}

Driver memory (i.e., memory available for H2O client running inside the Spark driver) can be configured via the Spark
configuration property \texttt{spark.driver.memory}. For example, as:

\begin{lstlisting}[style=bash]
./bin/sparkling-shell --conf spark.driver.memory=5g
\end{lstlisting}

or configure the property in:

\begin{lstlisting}[style=bash]
$SPARK_HOME/conf/spark-defaults.conf
\end{lstlisting}

In the external backend, only the H2O client (Scala only) is running in the Spark driver and is affected by Spark
memory configuration. Memory has to be configured explicitly for the H2O nodes in the external backend via the
\texttt{spark.ext.h2o.external.memory} option or \texttt{setExternalMemory} setter on \texttt{H2OConf}.

For YARN-specific configuration, refer to the Spark documentation \url{https://spark.apache.org/docs/latest/running-on-yarn.html}.
